Advanced neuroscience - Dr.Ali Ghazizade
Hamed Nejat - 96102578
As an example of generalized reinforcement learning, we consider the water maze task. This is a navigation problem in which rats are placed in a large pool of milky water and have to swim around until they find a small platform that is submerged slightly below the surface of the water. The opaqueness of the water prevents them from seeing the platform directly, and their natural aversion to water (although they are competent swimmers) motivates them to find the platform. After several trials, the rats learn the location of the platform and swim directly to it when placed in the water. We are going to simulate a simple model of navigation problem.

15x15 map
Fixed target
Fixed cat
Random starting point in eacg trial
4 directions for moving
Probability to move in 4 directions
Uniform probability at first
Step by step movement
End; being with cat or in target
For better implementation, we've implemented a class called "WaterMaze" with those parameters; Dont forget to uncomment these lines to download my lib python file from my github repository
!pip install wget
import wget
import warnings
warnings.filterwarnings('ignore')
wget.download('https://raw.githubusercontent.com/HNXJ/AdvNeuroscience/master/Models.py')
from Models import WaterMaze, WaterMazeT
from matplotlib import pyplot as plt
import numpy as np
import warnings
import time
In this part, we've integrated both paths and contours during training for each scenario and parameters.
In this state, every path increases its likelihood by epsilon, if it reachs target and else, its likelihood will be decreased; we can see that cat target is rarely reached because after some trials, mouse finds the target and as there is no stochasticity, it does'nt go to lower likelihoods and some "benefit paths" are formed, that they are shortcut to the target with less complexity.
w1 = WaterMaze(eps=0.1)
w1.runSession(epochs=500, maxlen=100, animate=False, mode="argmax", epsl=0.0, lambd=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w1.animate(ax=plts)
w1.plotpath(ax=plts, title="Last session (500) with an example of path", paths=w1.paths)
k = 50
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w1.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w1.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w1.logs[i*k].path)
Here, we will apply same policy but with exploration probability 0.03;
w2 = WaterMaze(eps=0.1)
w2.runSession(epochs=200, maxlen=100, animate=False, mode="argmax", epsl=0.03, lambd=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w2.animate(ax=plts)
w2.plotpath(ax=plts, title="Last session (1000) with an example of path", paths=w2.paths)
k = 20
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w2.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w2.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w2.logs[i*k].path)
In this state, every path increases its likelihood by epsilon, if it reachs target and else, its likelihood will be decreased; we can see that cat target is rarely reached because after some trials;
w3 = WaterMaze(eps=0.2)
w3.runSession(epochs=400, maxlen=100, animate=False, mode="probabilistic", epsl=0.0, lambd=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w3.animate(ax=plts)
w3.plotpath(ax=plts, title="Last session (1000) with an example of path", paths=w3.paths)
k = 40
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w3.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w3.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w3.logs[i*k].path)
w4 = WaterMaze(eps=0.2)
w4.runSession(epochs=400, maxlen=100, animate=False, mode="probabilistic", epsl=0.03, lambd=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w4.animate(ax=plts)
w4.plotpath(ax=plts, title="Last session (400) with an example of path", paths=w4.paths)
k = 40
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w4.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w4.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w4.logs[i*k].path)
In this method, we change the update step by adding temporal difference error policy (TD) to the update equation: $$ V_{t+1}(s_{t-k}) := V_{t}(s_{t-k}) + \lambda^{t}(V_{t+1}(S_{t}) - V_{t}(S_{t})) $$
The second term of this equation is called temporal difference (TD) value, special case of assuming $\lambda = 0$ is somehow equivalent to R-W method.
w5 = WaterMaze(eps=0.2)
w5.runSessionTD(epochs=400, maxlen=100, animate=False, mode="probabilistic", epsl=0.03, lambds=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w5.animate(ax=plts)
w5.plotpath(ax=plts, title="Last session (400) with an example of path", paths=w5.paths)
k = 40
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w5.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w5.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w5.logs[i*k].path)
In this part, we change these values for our models with different policies. As we know from learning theory, lower learning rates in iterativive models cause low convergence speed and higher rates, at first cause higher convergence speed and very high learning rates cause oscillation in learning, as the state jumps between semi-optimal states and may not reach optimal state.
We expect that increasing the $\epsilon$ at first increases the convergence speed, but after some values this may lead to large and uncontrolled changes in model predictions.
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=epsil[i*3 + j])
w.runSession(epochs=500, maxlen=50, animate=False, mode="argmax", epsl=0.0, lambd=1.0)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=epsil[i*3 + j])
w.runSession(epochs=4000, maxlen=50, animate=False, mode="probabilistic", epsl=0.0, lambd=1.0)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
Epsilon effect for this scenario is almost same with first scenario
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=epsil[i*3 + j])
w.runSession(epochs=1000, maxlen=30, animate=False, mode="argmax", epsl=0.1, lambd=1.0)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
For lower epsilons, we can see linear increasing accuracy; that means the epsilon is low $(\epsilon < 0.03)$ and for values about 0.1, the convergence is almost optimal (in average).
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=epsil[i*3 + j])
w.runSessionTD(epochs=1000, maxlen=100, animate=False, mode="argmax", epsl=0.0, lambds=0.8)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
epsil = np.logspace(-2, -0.1, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=epsil[i*3 + j])
w.runSessionTD(epochs=1000, maxlen=30, animate=False, mode="argmax", epsl=0.03, lambds=0.8)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
Lower discount factor causes less changes in far parts of path, so the propagation will be less and we expect lower speed in training; here we can see that this is happening, higher diccount factors (near 1) have better effect on training procedure and less oscillation of accuracy.
gammas = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=0.1)
w.runSession(epochs=1000, maxlen=30, animate=False, mode="argmax", epsl=0.0, lambd=gammas[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for gamma = {:.2f}".format(gammas[i*3 + j]))
Stochastic scenario in MDM is not good, map is not explored totally and this stochasticness causes wrong updates in the reward map.
gammas = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=0.1)
w.runSession(epochs=400, maxlen=30, animate=False, mode="probabilistic", epsl=0.0, lambd=gammas[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for gamma = {:.2f}".format(gammas[i*3 + j]))
gammas = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=0.1)
w.runSession(epochs=1000, maxlen=30, animate=False, mode="probabilistic", epsl=0.05, lambd=gammas[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for gamma = {:.2f}".format(gammas[i*3 + j]))
In deterministic mode, TD($\lambda$) has the most convergence speed and optimum lambda is about 0.1, that means probably best error propagation rate is this rate, combining with optimum epsilon of 0.3.
lambds = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=0.3)
w.runSessionTD(epochs=500, maxlen=30, animate=False, mode="argmax", epsl=0.0, lambds=lambds[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for lambda = {:.2f}".format(lambds[i*3 + j]))
Stochastic TD with exploration has same problem with previous stochastic scenario, as this accuracy is based on all of map, this stochasticness causes bad-training for part of map that is near to punishment and well-training for part of map near targets, so expected accuracy will not be near 1 and for example, in hear it is about 0.6 in optimal form.
lambds = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMaze(eps=0.3)
w.runSessionTD(epochs=1000, maxlen=30, animate=False, mode="probabilistic", epsl=0.05, lambds=lambds[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for lambda = {:.2f}".format(lambds[i*3 + j]))
In this part, we will change map with two targets with previous options, and then we change main parameters.
As we can see in all parts (expect stochastic mode) the convergence speed is increasing when $\epsilon$ is incresed from 0 to about 0.3, and then it gets oscillated for $\sim \epsilon > 0.4$, the convergence does not differ from previous part but there is another change, expected rewards near higher target are higher than lower target, as the reward propagation is higher in that and this causes the differece.
w1 = WaterMazeT(eps=0.1)
w1.runSession(epochs=1000, maxlen=100, animate=False, mode="argmax", epsl=0.0, lambd=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w1.animate(ax=plts)
w1.plotpath(ax=plts, title="Last session (1000) with an example of path", paths=w1.paths)
k = 60
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w1.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w1.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w1.logs[i*k].path)
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=epsil[i*3 + j])
w.runSession(epochs=500, maxlen=50, animate=False, mode="argmax", epsl=0.0, lambd=1.0)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
gammas = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=0.1)
w.runSession(epochs=1000, maxlen=30, animate=False, mode="argmax", epsl=0.0, lambd=gammas[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for gamma = {:.2f}".format(gammas[i*3 + j]))
Stochastic approach almost have no gaurantee for good decision, as training phase is stochastic some part of map that is very far from tragets will be bad-trained and the expected reward is bad-scaled near the punishment.
w2 = WaterMazeT(eps=0.1)
w2.runSession(epochs=200, maxlen=100, animate=False, mode="argmax", epsl=0.03, lambd=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w2.animate(ax=plts)
w2.plotpath(ax=plts, title="Last session (1000) with an example of path", paths=w2.paths)
k = 20
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w2.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w2.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w2.logs[i*k].path)
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=epsil[i*3 + j])
w.runSession(epochs=1000, maxlen=50, animate=False, mode="probabilistic", epsl=0.0, lambd=1.0)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
gammas = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=0.1)
w.runSession(epochs=1000, maxlen=30, animate=False, mode="probabilistic", epsl=0.0, lambd=gammas[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for gamma = {:.2f}".format(gammas[i*3 + j]))
w4 = WaterMazeT(eps=0.2)
w4.runSession(epochs=400, maxlen=100, animate=False, mode="probabilistic", epsl=0.03, lambd=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w4.animate(ax=plts)
w4.plotpath(ax=plts, title="Last session (400) with an example of path", paths=w4.paths)
k = 40
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w4.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w4.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w4.logs[i*k].path)
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=epsil[i*3 + j])
w.runSession(epochs=1000, maxlen=30, animate=False, mode="probabilistic", epsl=0.1, lambd=1.0)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
gammas = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=0.1)
w.runSession(epochs=1000, maxlen=30, animate=False, mode="probabilistic", epsl=0.05, lambd=gammas[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for gamma = {:.2f}".format(gammas[i*3 + j]))
In deterministic TD, everything is similar with one target mode but as we have no transient probability and path choosing is based on highest probability so after some seesions, the error rate will be very low as we can see here.
w5 = WaterMazeT(eps=0.2)
w5.runSessionTD(epochs=400, maxlen=100, animate=False, mode="argmax", epsl=0.0, lambds=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w5.animate(ax=plts)
w5.plotpath(ax=plts, title="Last session (400) with an example of path", paths=w5.paths)
k = 40
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w5.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w5.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w5.logs[i*k].path)
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=epsil[i*3 + j])
w.runSessionTD(epochs=1000, maxlen=100, animate=False, mode="argmax", epsl=0.0, lambds=0.8)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
lambds = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=0.5)
w.runSessionTD(epochs=1000, maxlen=30, animate=False, mode="argmax", epsl=0.0, lambds=lambds[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for gamma = {:.2f}".format(lambds[i*3 + j]))
Again, as we expected before, the contour heatmap is higher near targets and lower near the punishment, although we have two targets, the other significant observation is that the values near the higher target are higher.
w5 = WaterMazeT(eps=0.2)
w5.runSessionTD(epochs=400, maxlen=100, animate=False, mode="probabilistic", epsl=0.03, lambds=1.0)
figs, plts = plt.subplots(figsize=(15, 15))
w5.animate(ax=plts)
w5.plotpath(ax=plts, title="Last session (400) with an example of path", paths=w5.paths)
k = 40
fig, ax = plt.subplots(figsize=(20, 20), nrows=3, ncols=3)
for i in range(9):
ax[int(i/3), i%3].imshow(w5.logs[i*k].maze)
ax[int(i/3), i%3].set_title("Maze expected reward colormap in session no." + str(i*k))
w5.plotpath(ax=ax[int(i/3), i%3], title="Maze expected reward colormap in session no." + str(i*k), paths=w5.logs[i*k].path)
epsil = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=epsil[i*3 + j])
w.runSessionTD(epochs=1000, maxlen=30, animate=False, mode="argmax", epsl=0.1, lambds=0.8)
w.errPlotter(ax=ax[i, j], title="Mean performance for epsilon = {:.2f}".format(epsil[i*3 + j]))
lambds = np.logspace(-2, 0, 9)
fig, ax = plt.subplots(figsize=(20, 10), ncols=3, nrows=3)
for i in range(3):
for j in range(3):
w = WaterMazeT(eps=0.5)
w.runSessionTD(epochs=1000, maxlen=30, animate=False, mode="probabilistic", epsl=0.05, lambds=lambds[i*3 + j])
w.errPlotter(ax=ax[i, j], title="Mean performance for lambda = {:.2f}".format(lambds[i*3 + j]))
In TD-lambda approach, similar to other approaches when $\epsilon$ is increased from value near 0 to 1, at first learning speed (convergence speed) is increased and then, it gets unstable and oscillates; this may be due to large state steps and larger changes, and in lower learning rates the changes are lower obviously that cause lower speed in convergence.